including data origin
This is my final project for the Data Management and Visualisation module within the Systems Neuroscience MSc. This project will be exploring a raw data set collected for skin cancer in 2018, this data set was made available for use on kaggle. This data set included 10015 patients which included their age, sex, and localization of skin cancer.
I am interested if there will be any trends present within the data which suggests a reliable way to potentially determine those more at risk of skin cancer.
To explore this my research questions are the following:
Is there a more common area/s of where skin cancer is located on the body, regardless of gender or age.
When considering the top five most common locations of skin cancer (obtained in visualisation 1), are there any noticable patterns present with the average age of either male or female patients?
What are the most common age groups for a skin cancer diagnosis, regardless of localization.
Each visualisation will explore the research questions in order.
# Importing of the data
df <- read.csv(here("data", "raw_data.csv"))
# to display the top of the data frame
head(df)
## lesion_id image_id dx dx_type age sex localization
## 1 HAM_0000118 ISIC_0027419 bkl histo 80 male scalp
## 2 HAM_0000118 ISIC_0025030 bkl histo 80 male scalp
## 3 HAM_0002730 ISIC_0026769 bkl histo 80 male scalp
## 4 HAM_0002730 ISIC_0025661 bkl histo 80 male scalp
## 5 HAM_0001466 ISIC_0031633 bkl histo 75 male ear
## 6 HAM_0001466 ISIC_0027850 bkl histo 75 male ear
# Organise the data so that it is grouped by localization and add a percentage column
skincancer <- df %>%
group_by(localization) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(percentage = scales::percent(perc))
# to see the new layout of the data set.
head(skincancer)
## # A tibble: 6 × 4
## localization n perc percentage
## <chr> <int> <dbl> <chr>
## 1 acral 7 0.000698 0.070%
## 2 genital 48 0.00479 0.479%
## 3 ear 56 0.00559 0.559%
## 4 hand 90 0.00898 0.898%
## 5 scalp 128 0.0128 1.277%
## 6 neck 169 0.0169 1.686%
Creating a bar chart to explore where the most common skin cancer localization are.
# bar plot
p <- ggplot(skincancer,
aes(x = localization, y = n, fill = percentage))
p + geom_bar(stat = "identity") +
# Manually input the colour of each bar
scale_fill_manual(values = c("gold3", "gold2", "gold1", "gold", "yellow2",
"yellow1", "yellow", "lightgoldenrod1", "khaki1",
"lightgoldenrod2", "lightgoldenrod", "palegoldenrod",
"lightgoldenrodyellow", "lightyellow1", "lightyellow")) +
# Add labels to the axis' and the plot
labs(title = "Patients Localization of Skincancer", x = "Localization of Skincancer", y = "Number of reported Cases") +
# Change the font to bold and italic
theme(axis.title = element_text(face = "bold.italic", colour = "black"), plot.title = element_text(face = "bold")) +
#flip the graph
coord_flip()
# add hover text to the graph
library(htmlwidgets)
p <- ggplotly(tooltip = c("text", "x", "y", "percentage"))
p
# save the interactive bar chart
htmlwidgets::saveWidget(as_widget(p), here("figs", "InteractiveBar.html"))
From the bar chart we can see that upper extremity, trunk, lower extremity, back and abdomen are the most common areas for skin cancer, within our patient group. The next visualisation will access if there are any patterns present when considering the average age and gender for each skin cancer location.
Creating a line chart to explore if there any noticable patterns present within the average age of either male or female patients
# you want to only include the relevant localizations of skin cancer.
# the gender 'unknown' will also be removed
age_sex_sc <- df %>%
select(localization, age, sex) %>%
filter(localization == c("upper extremity", "trunk", "lower extremity", "back", "abdomen")) %>%
subset(sex != "unknown") %>%
group_by(localization, sex) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
# To view the new format of the data
head(age_sex_sc)
## # A tibble: 6 × 3
## # Groups: localization [3]
## localization sex age
## <chr> <chr> <dbl>
## 1 abdomen female 44.5
## 2 abdomen male 52.8
## 3 back female 48.9
## 4 back male 55.1
## 5 lower extremity female 48.1
## 6 lower extremity male 54.1
# this ensures that the male and female lines will be in different colours.
colour_sex <- c("yellow1", "gold")
# line chart
aa <- ggplot(age_sex_sc, aes(x = localization, y = age, group = sex))
aa + geom_line(linetype= "longdash") +
# Add points for each male and female average age as different colours so they are easily distinguishable, also add a point in the centre.
geom_point(aes(fill=factor(sex)), size = 4, shape = 21, stroke = 0.5) +
geom_point(color = "#FFFFFF", size = 1) +
scale_fill_manual(values = colour_sex) +
# Add a boarder to the plot, ensuring its not too bold as it would distract from the findings.
theme(panel.border = element_rect(linetype = "solid", fill = "NA")) + theme(panel.grid.major = element_line(colour = "black", size = 0.2)) +
# Add titles to the plot to make it clear what the data is displaying. Edited so that the x and y titles are bold and in italics.
labs(title = "The Average Age of Patients with Skin Cancer", subtitle = "Seperated by Gender and most common skin cancer Localizations", x = "Localization of skin cancer", y = "Average age") + theme(axis.title = element_text(face = "bold.italic", colour = "black"), plot.title = element_text(face = "bold"))
aa <- ggplotly(tooltip = c("text", "x", "y"))
aa
# save the interactive bar chart
htmlwidgets::saveWidget(as_widget(aa), here("figs", "InteractiveLine.html"))
From this line chart we can clearly see that although age fluctuates per the different localizations of skin cancer, that there is a clear gender seperation present, whereby women throughout all localizations presented are on average younger than the males who have the same skin cancer.
Creating a population pyramid to explore the number of patients within each age group
# First we need to create a column for age groups, and group by this so that the number of patients diagnosed for each age group can be identified
df$AgeGroup <- cut(df$age,
breaks = c(0, 10, 20, 30, 40, 50, 60, 70
, 80, Inf),
labels = c( "0-9 years", "10-19", "20-29", "30-39"
, "40-49", "50-59", "60-69"
, "70-79", "80+"),
right = FALSE)
## The 'unknown' ages will be removed from the dataset, as well as unknown sex
agegroup <- df %>%
group_by(AgeGroup, sex) %>%
subset(AgeGroup != "unknown") %>%
subset(sex != "unknown") %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(percentage = scales::percent(perc)) %>%
# Add a further group_by function in order to obtain the total number of patients per age group to use in labelling the plot made next
group_by(AgeGroup, sex) %>%
mutate(Total_n = sum(n))
# To view the new data set layout
head(agegroup)
## # A tibble: 6 × 6
## # Groups: AgeGroup, sex [6]
## AgeGroup sex n perc percentage Total_n
## <fct> <chr> <int> <dbl> <chr> <int>
## 1 10-19 male 37 0.00372 0.372% 37
## 2 0-9 years female 49 0.00492 0.492% 49
## 3 0-9 years male 74 0.00743 0.743% 74
## 4 10-19 female 81 0.00813 0.813% 81
## 5 20-29 male 184 0.0185 1.848% 184
## 6 80+ female 195 0.0196 1.958% 195
# Create an animated population pyramid
# load the relevant libraries
library(ggpol)
library(gganimate)
pop <- ggplot(agegroup, aes(x = AgeGroup,
# use ifelse in order to seperate male and female as well as enabelling the axis to start at zero for both male and female
y = ifelse(sex == "male", n, -n),
fill = sex)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Total_n)) +
# Change the colours of the bars to clearly show a different colour for male and female
scale_fill_manual(values = c("yellow1", "gold")) +
# Add specific titles to the main title and axis
labs(title = "How the number of cases of skin cancer vary across age groups",
subtitle = "Separated by gender",
x = "Age Group of Patients",
y = "Number of patients") +
# Change the font size and bold and or italic for the speicifc titles and axis'
theme(plot.title = element_text(face = 'bold', size = 14),
axis.title.x = element_text(face = 'bold.italic', size = 12),
axis.title.y = element_text(face = 'bold.italic', size = 12),
axis.text = element_text(face = "bold", size = 10),
panel.grid.major = element_line(colour = "black", size = 0.2),
panel.grid.minor = element_blank(),
) +
# Add the animation to the population pyramid
transition_states(states = AgeGroup, transition_length = 2, state_length = 1) +
facet_share(~ sex, dir = 'h',
scales = 'free',
reverse_num = TRUE) +
enter_fade() +
exit_fade() +
coord_flip() +
ease_aes('cubic-in-out')
pop
# Save the animated plot
anim_save("pop.gif", animation = last_animation(), path = "figs")
From this population pyramid it can be observed that although there are many fluctuations throughout each age group that there does appear to be certain age groups which elicit a higher number of patients within a certain gender. For instance within the age group 50-59, 1037 cases were reported from females, whilst only 115 were reported for men. This could suggest that, within our data set at least, that those aged 50-59 and are female are at a higher risk of developing skin cancer.
From the visualisations presented there have been some interesting findings. The age gap between male and female participants when looking at the top five most common areas of cancer was an interesting finding to view. It suggested that men have a much higher average age for skin cancer, regardless of location when compared to female patients. As well as this when viewing the third visualisation it has suggested significant age groups where the different sex’ could be identified as being at a higher risk.
I would suggest that future research further explores survival rates and the relationship this has with both gender and age. I would also suggest that there is further work conducted on the gender separations present here and to potentially use the findings as a key in who to target particular skin cancer awareness to. Research would also be better provided on data that not just has a high number of reported cases but also looks at the data over several years.
This has been my first experience using R, and whilst it was incredibly daunting at first i have learnt to love the process of creating a project like this. There is nothing more infuriating than lines of code not working, however it is such an empowering feeling to solve the problem and get the finished product you were hoping for. I am definitely looking forward to my continued learning experience with R.